Writer Identification in Modern and Historical Documents via Binary Pixel Patterns, Kolmogorov–Smirnov Test and Fisher's Method
نویسندگان
چکیده
The authors present a new method of writer identification, employing the full power of multiple experiments, which yields a statistically significant result. Each individual binarized and segmented character is represented as a histogram of 512 binary pixel patterns—3 × 3 black and white patches. In the process of comparing two given inscriptions under a “single author” assumption, the algorithm performs a Kolmogorov–Smirnov test for each letter and each patch. The resulting p-values are combined using Fisher’s method, producing a single p-value. Experiments on both Modern and Ancient Hebrew data sets demonstrate the excellent performance and robustness of this approach. c © 2017 Society for Imaging Science and Technology. [DOI: 10.2352/J.ImagingSci.Technol.2017.61.1.010404] INTRODUCTION The current article deals with the challenging task of writer identification in historical documents. In what follows, we provide a short overview of the existing approaches to this task, and present the main contribution of this article. Prior Work The problem of computerized writer identification within historical documents exists in the literature for several decades.1 Several features and their combination methods have been proposed for that purpose. The article Ref. 1 uses run-length histograms, combined via PCA (first two components). Article Ref. 2 continues the use of run-length distributions, supplementing them with allographic features (grapheme codebook generated using self-organizing map); the feature fusion is performed via simple or weighted averaging distances due to the individual features. Similar allographic features (‘‘fraglets’’), optionally supplemented with edge-directional feature (‘‘hinge’’) are present in Ref. 3, with Hamming distance measures between the normalized features. The article Ref. 4 presents another feature combination technique; extracting 8 types of features pertaining to various relations between foreground and background pixels of segmented characters, as well as their central moments. The features are selected via dimension reduction techniques Received July 18, 2016; accepted for publication Nov. 12, 2016; published online Dec. 8, 2016. Associate Editor: Zeev Zalevsky. 1062-3701/2017/61(1)/010404/9/$25.00 such as sequential forward floating selection and linear discriminant analysis, classifying the reduced feature vectors via a linear Bayes classifier or K-nearest neighbors. Yet another set of classifiers based on grid microstructure, allograph level and topological features, combined via weighting procedure, is presented in Ref. 5. Article Ref. 6 provides a wealth of contour-based, oriented basic image, as well as SIFT, features classified by a voting procedure and SVM; Ref. 7 uses a similar setup, adding HOG features. An adaptation of SIFT features is also used in Ref. 8, with dimensionality reduced via PCA, resulting in a visual vocabulary. The features are clustered using a Gaussian Mixture Model and employing the Fisher kernel. A recent use of KDA in a setting involving both chain-code and edge-based directional features can be found in Ref. 9. A thoroughly different approach is demonstrated in Ref. 10, operating on a segmented character level, and treating them as realizations of estimated ‘‘Ideal Prototypes.’’ The identity or distinction among writers is made via several techniques, employing comparisons of the contours of the realizations to various ideals, and using heuristics and maximum likelihood estimations procedure combining information from different letters in order to find similar writers. A similar method is described in Ref. 11, with comparisons between character or ideal contours solved analytically. A review of these articles, as well as surveys of the broader field of writer identification,12,13 demonstrate the common denominators of most of these algorithms: a series of features (e.g., based on edge, allograph or topological information, or using ‘‘classical’’ computer vision features such as SIFT,HOGandGabor filters) is extracted.Optionally, the dimensionality is reduced (e.g., via weighting, LDA, PCA or KDA methods), followed by writer classification of the resulting feature vectors (e.g., by employing KNN, SVM, MLE or the Fisher Kernel). Usually, the question is whether a given document, according to some metric, is written by the same author as the most closely matching document. Alternatively, several (e.g., 5 or 10) ‘‘closest’’ documents are fetched for the purpose of identifying at least one identical writer. The algorithm’s performance is checked based on an existing ground truth. Although some of these methods perform reasonably for their tasks and data sets, their typical output is an J. Imaging Sci. Technol. 010404-1 Jan.-Feb. 2017 Shaus and Turkel: Writer identification in modern and historical documents abstract distance between two given inscriptions, or else a table indicating the distances between several inscriptions. However, these distances do not yield any probabilistic information. Thus, it is difficult to interpret such an output outside a well ground-truthed framework. In particular the distances, by themselves, are insufficient for the different task of analyzing a corpus ofmany inscriptions, with an unknown number of authors. The existing approaches can be contrasted with the direct predecessor of this article,14 which proposes a statistical approach. The article used a sophisticated concatenation and subsequent combination of SIFT, Zernike, DCT, and Kd -tree, image projections, CMI15–17 and L1 features and distances. Subsequently, on an inscriptions’ pair-wise basis, the writer identification analysis was performed independently for each letter. This resulted in different statistical p-values, estimating the probability of a single author producing the different letter instances of the two inscriptions. These independent p-values were later combined into a ‘‘meta’’ p-value via Fisher’s Method (a brief explanation is provided below), typically resulting in more significant results. Contrary to the existing approaches, such an approach can be easily utilized in order to detect different authors within any given corpus, by detecting ‘‘meta’’ p-values below certain threshold. Finally, we rely on previously developed document preprocessing techniques. In particular, we assume the existence of a suitable binarization, segmentation and (if needed) restoration of characters, whose quality is suitable for our needs. The inputs for the described method are individual black and white images of single characters, reflecting the original writing as reliably as possible (e.g., not thinned, no slant correction, etc.). Such automatic or semimanual techniques are described, for our data sets, in Refs. 17–19 and especially in Refs. 14, 20; for other approaches, consult the references mentioned in Ref. 4, 10, 11. Formethods assessing the adherence of character’s reconstruction to its image, as well as the general quality of the resulting binarization, see Refs. 15, 16, 21, 22. Additional details regarding the preparation of the different data sets are provided below.distance between two given inscriptions, or else a table indicating the distances between several inscriptions. However, these distances do not yield any probabilistic information. Thus, it is difficult to interpret such an output outside a well ground-truthed framework. In particular the distances, by themselves, are insufficient for the different task of analyzing a corpus ofmany inscriptions, with an unknown number of authors. The existing approaches can be contrasted with the direct predecessor of this article,14 which proposes a statistical approach. The article used a sophisticated concatenation and subsequent combination of SIFT, Zernike, DCT, and Kd -tree, image projections, CMI15–17 and L1 features and distances. Subsequently, on an inscriptions’ pair-wise basis, the writer identification analysis was performed independently for each letter. This resulted in different statistical p-values, estimating the probability of a single author producing the different letter instances of the two inscriptions. These independent p-values were later combined into a ‘‘meta’’ p-value via Fisher’s Method (a brief explanation is provided below), typically resulting in more significant results. Contrary to the existing approaches, such an approach can be easily utilized in order to detect different authors within any given corpus, by detecting ‘‘meta’’ p-values below certain threshold. Finally, we rely on previously developed document preprocessing techniques. In particular, we assume the existence of a suitable binarization, segmentation and (if needed) restoration of characters, whose quality is suitable for our needs. The inputs for the described method are individual black and white images of single characters, reflecting the original writing as reliably as possible (e.g., not thinned, no slant correction, etc.). Such automatic or semimanual techniques are described, for our data sets, in Refs. 17–19 and especially in Refs. 14, 20; for other approaches, consult the references mentioned in Ref. 4, 10, 11. Formethods assessing the adherence of character’s reconstruction to its image, as well as the general quality of the resulting binarization, see Refs. 15, 16, 21, 22. Additional details regarding the preparation of the different data sets are provided below. The Main Contribution of this Article In this research, we advance the ideas of Ref. 14 to the next level. The analysis is performed independently, not only on a level of a single letter, but also on the level of a single feature, unleashing the full statistical power of multiple experiments. The main changes from Ref. 14 are: an entirely different, and much larger set of features (using 512 different binary pixel patterns instead of a combination of 7 features); a two-step experimental process, working on both individual feature (by comparing the feature distributions via Kolmogorov–Smirnov test), as well as individual letter level in order to deduce the p-values, later to be combined via Fisher’s method (potentially, thousands of experiments, equaling the number of letters multiplied by the number of features, are conducted!); and an improvement in the significance level of the results by lowering the p-value threshold. All these allow us to establish a robust platform for analyzing corpora of many inscriptions, with an unknown number of authors, while arriving at meaningful and statistically highly significant outcomes. A schematic comparison of the various handwriting analysis schemes is presented in Figure 1. ALGORITHM’S DESCRIPTION Preliminary Remarks We use the common statistical convention of defining a ‘‘null hypothesis’’ H0 and trying to disprove it. In our case, H0 is ‘‘two given inscriptions were written by the same author.’’ The probability for this event is the p-value, which will be estimated via the algorithm. If the p-value is lower than a pre-defined threshold, H0 is rejected, and the competing hypothesis of ‘‘two different authors’’ is declared valid. On the other hand, an inability to reject the null hypothesis does not indicate its validity. In such a casewe remain agnostic, not being able to say anything regarding the documents’ authors. The estimation of the p-value involves an activation of the Kolmogorov–Smirnov (KS) test, a classical nonparametric test, allowing for a comparison of two samples, not necessarily of the same size.23 The main idea of KS is a comparison of the empirical distribution functions F1 and F2 (produced from the two samples), in order to calculate the observed statistic D = supx |F1(x)− F2(x)|. The p-value of this statistic, under the hypothesis that the two samples stem from the same distribution, can be either calculated directly (via permutations) or approximated (our research utilizes the implementation24). For example, if the samples’ sizes are large enough, and all the values within the first sample are smaller than the values of the second sample, the p-value should be low.A previous usage of Kolmogorov–Smirnov test in a signature verification setting can be seen in Ref. 25. Another well-established technique used by the algorithm is Fisher’s method for p-value combinations.26 Given p-values pi (i= 1, . . . , k) stemming from k independent experiments, the method allows one to estimate a combined pvalue, reflecting the entire wealth of evidence at our disposal. The method utilizes the fact that X2 2k ∼−2 ∑k i=1 ln(pi), i.e., the sum produces a chi-squared distribution with 2k degrees of freedom. This allows for a calculation of a single combined (‘‘meta’’) p-value. Intuitively, if several experiments produce low p-values (e.g., 0.1, 0.15 and 0.2), the probability for such an occurrence, by chance, is very small, and the combined p-value will also be low (possibly even lower than the original p-values; 0.071 for the last example). However, in the current article, the p-values of multiple experiments (stemming from different characters and features) are not necessarily independent, but are expected to be positively correlated. Thus, we are ‘‘overconfident’’ in the combined evidence against H0. A common remedy to this problem is to demand more significant results, by substituting T with T · (k+ 1)/2k (k is the number of experiments), a common modification representing a mean of false discovery rates.27 In our case, this demand can be satisfied simply by lowering J. Imaging Sci. Technol. 010404-2 Jan.-Feb. 2017 Shaus and Turkel: Writer identification in modern and historical documents Figure 1. A comparison of handwriting analysis schemes. Left: common frameworks, producing an abstract distance between the documents as a final output. Center: the method of Ref. 14, performing the analysis on per-letter basis, yielding (number of letters) experimental p -values to be combined via Fisher’s method. Right: the current technique, performing Kolmogorov–Smirnov tests for each feature and each letter, yielding (number of features) × (number of letters) experimental p -values to be combined via Fisher’s method. the threshold p-value T from 0.2 (as in Ref. 14) to 0.1 or even 0.05. Prior Assumptions We begin with two images of different inscriptions, denoted as I and J . The algorithm operates based on information derived at a character level. Herein, by a character we denote a particular instance of a given letter (e.g., there may be many characters, which are all instances of a letter alep). As remarked above, we assume that the inscriptions’ characters are binarized and segmented into images I l il (il = 1, . . . ,Ml , representing the instances of the letter l within I ); and J l jl (jl = 1, . . . ,Nl , representing the instances of the same letter l within J ), belonging to appropriate letters (l = 1, . . . , L). In the current research, the binarization and segmentation was performed automatically for Modern Hebrew, and in semimanual fashion for Ancient Hebrew documents.14,20 The resulting characters’ images were padded with a 1-pixel white border on each side. Histogram Creation for each Character Our features are the 3× 3 binary pixel patterns, i.e., image patches of the individual characters (for additional information on pixel patterns, see the examples in Refs. 28, 29). There are 29 = 512 optional patches of that size. All such possible patches are extracted from the images I l il and J l jl , in order to create normalized patches’ histograms (counting frequencies of patches’ occurrences), H l il (p) and Gjl (p), respectively (p= 1, . . . , 512). A simple, yet illustrative, example of two such images and their respective histograms is seen in Table I. Remarkably, despite a similar overall shape of the character and only two pixels difference in the character images, 16 out of 19 meaningful histogram entities are different. We note that the histograms only serve normalization purposes. In the following, the histograms themselves will not be compared. Instead, the comparison will take place on an individual feature (patch) level, across different characters. Same-Writer Statistics Derivation The experiments are performed in the following fashion: for given inscriptions’ images I and J with I 6= J : 1. An empty PVALS array is initialized. 2. For each letter l = 1, . . . , L, with sufficient character instances present (Ml > 0, Nl > 0, Ml +Nl ≥ 4; we verify there is enough statistics for a meaningful comparison, slightly lowering the requirements in Ref. 14): 2.1 For each patch p = 1, . . . , 512, with at least one nonzero term present in the histogram (i.e., ∃il · H l il (p) > 0 OR ∃jl · G l jl (p) > 0), perform a Kolmogorov–Smirnov (KS) nonparametric J. Imaging Sci. Technol. 010404-3 Jan.-Feb. 2017 Shaus and Turkel: Writer identification in modern and historical documents Table I. Example of character histograms. test between the two samples {H l il (p)} Ml il=1 and {Gjl (p)} Nl jl=1: pval l p =KS({H l il (p)} Ml il=1 {G l jl (p)} Nl jl=1). 2.2 Append the resulting pvalp to the PVALS array. 3. If the PVALS array is empty (i.e., no experiments were performed due to the scarcity of data), OR if I = J , set: SameWriterP(I , J )= SameWriterP(J , I)= 1. 4. Otherwise utilize the Fisher combination of all the PVALS instances, and set: SameWriterP(I , J )= SameWriterP(J , I) = FisherMethod(PVALS) SameWriterP(I , J ) represents the deduced probability of having the same writer in both I and J (the H0 hypothesis). A toy-problem illustration of the whole scheme is shown in Figure 2. In this demonstration, two alep letters and four bet letters are segmented from the first document, while three alep and two bet letters are segmented from the second document. As a first step, patches histograms are extracted from the two documents. For illustration Figure 2. An example of the same-writer statistics derivation for two hypothetical inscriptions. Inscription I consists of two instances of the letter alep, and four instances of the letter bet, while Inscription II consists of three instances of the letter alep, and two instances of the letter bet. The only patches with enough statistics are patches numbers 1 and 2. Four comparisons of appropriate samples (for each letter and each patch) are performed via Kolmogorov–Smirnov test, resulting in four different p -values. These p -values are then combined via Fisher’s method. purposes, it is assumed that in both cases, only the first two patches yield a nonzero count. Since two types of relevant features and two different letters are involved, 2 × 2 = 4 Kolmogorov–Smirnov tests are performed, yielding four p-values. These are combined into a single p-value via Fisher’s method. MODERNHEBREW EXPERIMENT The Basic Settings This experiment closely follows the setting described in Ref. 14. The data set (available at Ref. 30) contains a sampling of 18 individuals, k = 1, . . . , 18. Each individual person filled in a Modern Hebrew alphabet table consisting of ten occurrences of each letter, out of the 22 letters in the alphabet (the number of letters and their names are the same as in the Ancient Hebrew in the next experiment; see Figure 3 for a table example). These tableswere scanned and thresholded in order to create black and white images. Then their characters were segmented utilizing their known bounding box location (Fig. 3). From this raw data, a series of ‘‘simulated’’ inscriptions were created. Due to the need to test both same-writer and different-writer scenarios, the data for each writer was split. Furthermore, in order to imitate a common situation in the Ancient Hebrew experiment, where the scarcity of data is prevalent (see below), each simulated inscription used only three letters (i.e., 15 characters; 5 characters for each letter), presenting a welcomed challenge for the new algorithm. In total, 252 inscriptions were ‘‘simulated’’ in the following manner: • All the letters of the alphabet except for yod (due to its small size), were split randomly into seven groups (three letters in each group), g = 1, . . . , 7: gimel, het, J. Imaging Sci. Technol. 010404-4 Jan.-Feb. 2017 Shaus and Turkel: Writer identification in modern and historical documents Figure 3. An example of a Modern Hebrew alphabet table, produced by a single writer; taken from Ref. 14. resh; bet, samek, shin; dalet, zayin, ayin; tet, lamed,mem; nun, sade, taw; he, pe, qop; alep, waw, kap. • For each writer k, and each letter belonging to group g , five characters were assigned into simulated inscription Si,g ,1, with the rest assigned to Si,g ,2. In this fashion, for constant k and g , we can test if our algorithm arrives at wrong rejection for Si,g ,1 and Si,g ,2 (FP= ‘‘False Positive’’ error; 18 writers and seven groups producing 126 tests in total). In addition, for constant g , writer q s.t. q 6= k, and b, c ∈ {1, 2}, we can test if our algorithm fails to correctly reject the ‘‘same-writer’’ hypothesis for Sk,g ,b and Sq,g ,c (FN= ‘‘False Negative’’ error; 4284 tests in total). Parameter Tuning and Robustness Verification The algorithm described in the Algorithm’s Description section provides an estimated probability for the H0 hypothesis (‘‘the two given inscriptions were written by the same writer’’). However, two important parameters remain undecided. The first important parameter is the typical area of each character in pixels, leading to the optimal (or at least acceptable) performance. The second crucial parameter is Figure 4. Testing the combined probability of FP + FN errors as a function of character area (in pixels) as well as different p value thresholds: , and . Taking into account the performance in Ref. 14 (FP+ FN≈ 0.043), all the tested thresholds and all the areas between 1000 and 40,000 pixels would yield reasonable and comparable performance. Slightly better results are achieved in the range of 8000–20,000 pixels, with 0.1 threshold. the p-value threshold T , set for the purpose of rejecting the H0. As is common in statistics, lowering T can result in fewer FP errors, unfortunately increasing the likelihood for FN errors. Conversely, raising T might result in the opposite outcome. In order to minimize the FP and FN errors, a set of simulations was performed. The simulations measured the behavior of the sum FP+ FN, with respect to the area of the character’s image (ranging from 200 to 50,000 pixels), and to the chosen value of T (attempting the value 0.2 chosen by Ref. 14, as well as the values 0.1 and 0.05, as explained above). The results of these simulations is shown in Figure 4. Taking into account the performance in Ref. 14 (FP+ FN≈ 0.043), all the tested thresholds and all the areas between 1000 and 40,000 pixels yield a reasonable and comparable performance (FP+ FN < 0.05), with slightly better results in the range of 8000–20,000 pixels, with T = 0.1. This wide range for acceptable areas indicates an excellent robustness of the algorithm (though the algorithm would probably result in better outcomes if the character images were of similar resolution). Since the mean area of the original character images was 17,367 pixels, well within the reasonable limits of our analysis, we have chosen the typical area of each character to be 17,000 pixels. Experimental Results The results of our configuration (for different values ofT ) are provided in Table II. The results are certainly better than the results of Ref. 14 on the same data set, with a much simpler configuration. As expected, FP error rate tends to zero as the threshold is lowered, while the FN increases slightly. The threshold value of T = 0.1 produced better results, with a combined FP+ FN error of less than 2%. J. Imaging Sci. Technol. 010404-5 Jan.-Feb. 2017 Shaus and Turkel: Writer identification in modern and historical documents Figure 5. Examples of ostraca (ink inscriptions on clay) from the Iron Age fortress of Arad, located in arid southern Judah. These documents are dated to the latest phase of the First Temple Period in Judah, ca. 600 BCE. The texts represent correspondence of local military personnel. Table II. Results of modern Hebrew experiment. Confident in our newly obtained configuration (target area of ∼17,000 pixels and T = 0.1), we proceed to the Ancient Hebrew experiment. ANCIENT HEBREW EXPERIMENT The Basic Settings The Ancient Hebrew data set31 stems from the Judahite desert fortress of Arad, dated to the end of the First Temple period (IronAge), ca. 600 BCE—the eve ofNebuchadnezzar’s destruction of Jerusalem. The fortress was unearthed half a century ago, with 100 ostraca (ink on clay) inscriptions found during the excavations.32 The inscriptions represent the correspondence of the local military personnel. See Figure 5 for examples of Arad ostraca. We concentrate on 16 (relatively lengthy) Arad ostraca, two of them two-sided, which brings the total number of texts for analysis to 18. The scarcity of data in this situation is common for these ancient texts. Ostraca images were utilized Table III. Letter statistics for Arad texts (adapted from Ref. 14).
منابع مشابه
Junction detection in handwritten documents and its application to writer identification
In this paper, we propose a novel junction detection method in handwritten images, which uses the stroke-length distribution in every direction around a reference point inside the ink of texts. Our proposed junction detection method is simple and efficient, and yields a junction feature in a natural manner, which can be considered as a local descriptor. We apply our proposed junction detector t...
متن کاملAutomatic Face Recognition via Local Directional Patterns
Automatic facial recognition has many potential applications in different areas of humancomputer interaction. However, they are not yet fully realized due to the lack of an effectivefacial feature descriptor. In this paper, we present a new appearance based feature descriptor,the local directional pattern (LDP), to represent facial geometry and analyze its performance inrecognition. An LDP feat...
متن کاملWriter Identification Using Junction Features
In this chapter, we propose a novel junction detection method in handwritten images, which uses the stroke-length distribution in every direction around a reference point inside the ink of texts. Our proposed junction detection method is simple and efficient, and yields a junction feature in a natural manner, which can be considered as a local descriptor. We apply our proposed junction detector...
متن کاملWriter Identification Using Junction Features
In this chapter, we propose a novel junction detection method in handwritten images, which uses the stroke-length distribution in every direction around a reference point inside the ink of texts. Our proposed junction detection method is simple and efficient, and yields a junction feature in a natural manner, which can be considered as a local descriptor. We apply our proposed junction detector...
متن کاملWriter Identification Using Junction Features
In this chapter, we propose a novel junction detection method in handwritten images, which uses the stroke-length distribution in every direction around a reference point inside the ink of texts. Our proposed junction detection method is simple and efficient, and yields a junction feature in a natural manner, which can be considered as a local descriptor. We apply our proposed junction detector...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016